Goto

Collaborating Authors

 task 2


Interactive AI NPCs Powered by LLMs: Technical Report for the CPDC Challenge 2025

Huang, Yitian, Lei, Yuxuan, Lian, Jianxun, Liao, Hao

arXiv.org Artificial Intelligence

This report presents the solution and results of our team MSRA\_SC in the Commonsense Persona-Grounded Dialogue Challenge (CPDC 2025). We propose a simple yet effective framework that unifies improvements across both GPU Track and API Track. Our method centers on two key components. First, Context Engineering applies dynamic tool pruning and persona clipping for input compression, combined with post-processing techniques such as parameter normalization and function merging. Together with manually refined prompts, this design improves tool call stability, execution reliability, and role-playing guidance. Second, in the GPU Track, we further adopt GRPO training, replacing supervised fine-tuning with reinforcement learning directly optimized by reward signals. This mitigates small-sample overfitting and significantly enhances task-oriented dialogue performance. In the final evaluation, our team ranks 1st in Task 2 API, 2nd in Task 1 API, and 3rd in both Task 3 API and GPU track, demonstrating the effectiveness of our approach. Our code is publicly available at https://gitlab.aicrowd.com/nikoo_yu/cpdc-2025-winning-solution


Embedding Generative AI into Systems Analysis and Design Curriculum: Framework, Case Study, and Cross-Campus Empirical Evidence

Elkhodr, Mahmoud, Gide, Ergun

arXiv.org Artificial Intelligence

Systems analysis students increasingly use Generative AI, yet current pedagogy lacks systematic approaches for teaching responsible AI orchestration that fosters critical thinking whilst meeting educational outcomes. Students risk accepting AI suggestions blindly or uncritically without assessing alignment with user needs or contextual appropriateness. SAGE (Structured AI-Guided Education) addresses this gap by embedding GenAI into curriculum design, training students when to accept, modify, or reject AI contributions. Implementation with 18 student groups across four Australian universities revealed how orchestration skills develop. Most groups (84\%) moved beyond passive acceptance, showing selective judgment, yet none proactively identified gaps overlooked by both human and AI analysis, indicating a competency ceiling. Students strong at explaining decisions also performed well at integrating sources, and those with deep domain understanding consistently considered accessibility considerations. Accessibility awareness proved fragile. When writing requirements, 85\% of groups explicitly considered elderly users and cultural needs. Notably, 55\% of groups struggled identifying when AI misclassified system boundaries (what belongs inside versus outside the system), 45\% missed data management errors (how information is stored and updated), and 55\% overlooked missing exception handling. Three implications emerge for educators: (i) require students to document why they accepted, modified, or rejected each AI suggestion, making reasoning explicit; (ii) embed accessibility prompts at each development stage because awareness collapses without continuous scaffolding; and (iii) have students create their own specifications before using AI, then compare versions, and anchor to research or standards to identify gaps.


Dialog-based Language Learning

Jason E. Weston

Neural Information Processing Systems

A long-term goal of machine learning research is to build an intelligent dialog agent. Most research in natural language understanding has focused on learning from fixed training sets of labeled data, with supervision either at the word level (tagging, parsing tasks) or sentence level (question answering, machine translation). This kind of supervision is not realistic of how humans learn, where language is both learned by, and used for, communication. In this work, we study dialog-based language learning, where supervision is given naturally and implicitly in the response of the dialog partner during the conversation. We study this setup in two domains: the bAbI dataset of [23] and large-scale question answering from [3]. We evaluate a set of baseline learning strategies on these tasks, and show that a novel model incorporating predictive lookahead is a promising approach for learning from a teacher's response. In particular, a surprising result is that it can learn to answer questions correctly without any reward-based supervision at all.




Ideal

Neural Information Processing Systems

The analysis of spatio-temporal sequences plays an important role in many real-world applications, demanding a high model capacity to capture the interdependence among spatial and temporal dimensions.



MedM2T: A MultiModal Framework for Time-Aware Modeling with Electronic Health Record and Electrocardiogram Data

Kuo, Yu-Chen, Tseng, Yi-Ju

arXiv.org Artificial Intelligence

The inherent multimodality and heterogeneous temporal structures of medical data pose significant challenges for modeling. We propose MedM2T, a time-aware multimodal framework designed to address these complexities. MedM2T integrates: (i) Sparse Time Series Encoder to flexibly handle irregular and sparse time series, (ii) Hierarchical Time-Aware Fusion to capture both micro- and macro-temporal patterns from multiple dense time series, such as ECGs, and (iii) Bi-Modal Attention to extract cross-modal interactions, which can be extended to any number of modalities. To mitigate granularity gaps between modalities, MedM2T uses modality-specific pre-trained encoders and aligns resulting features within a shared encoder. We evaluated MedM2T on MIMIC-IV and MIMIC-IV-ECG datasets for three tasks that encompass chronic and acute disease dynamics: 90-day cardiovascular disease (CVD) prediction, in-hospital mortality prediction, and ICU length-of-stay (LOS) regression. MedM2T outperformed state-of-the-art multimodal learning frameworks and existing time series models, achieving an AUROC of 0.947 and an AUPRC of 0.706 for CVD prediction; an AUROC of 0.901 and an AUPRC of 0.558 for mortality prediction; and Mean Absolute Error (MAE) of 2.31 for LOS regression. These results highlight the robustness and broad applicability of MedM2T, positioning it as a promising tool in clinical prediction. We provide the implementation of MedM2T at https://github.com/DHLab-TSENG/MedM2T.


SEER: The Span-based Emotion Evidence Retrieval Benchmark

Sampath, Aneesha, Aran, Oya, Provost, Emily Mower

arXiv.org Artificial Intelligence

We introduce the SEER (Span-based Emotion Evidence Retrieval) Benchmark to test Large Language Models' (LLMs) ability to identify the specific spans of text that express emotion. Unlike traditional emotion recognition tasks that assign a single label to an entire sentence, SEER targets the underexplored task of emotion evidence detection: pinpointing which exact phrases convey emotion. This span-level approach is crucial for applications like empathetic dialogue and clinical support, which need to know how emotion is expressed, not just what the emotion is. SEER includes two tasks: identifying emotion evidence within a single sentence, and identifying evidence across a short passage of five consecutive sentences. It contains new annotations for both emotion and emotion evidence on 1200 real-world sentences. We evaluate 14 open-source LLMs and find that, while some models approach average human performance on single-sentence inputs, their accuracy degrades in longer passages. Our error analysis reveals key failure modes, including overreliance on emotion keywords and false positives in neutral text.


Cracking CodeWhisperer: Analyzing Developers' Interactions and Patterns During Programming Tasks

Javahar, Jeena, Budhrani, Tanya, Basha, Manaal, de Souza, Cleidson R. B., Beschastnikh, Ivan, Rodriguez-Perez, Gema

arXiv.org Artificial Intelligence

Abstract--The use of AI code-generation tools is becoming increasingly common, making it important to understand how software developers are adopting these tools. In this study, we investigate how developers engage with Amazon's Code-Whisperer, an LLM-based code-generation tool. We conducted two user studies with two groups of 10 participants each, interacting with CodeWhisperer - the first to understand which interactions were critical to capture and the second to collect low-level interaction data using a custom telemetry plugin. Our mixed-methods analysis identified four behavioral patterns: 1) incremental code refinement, 2) explicit instruction using natural language comments, 3) baseline structuring with model suggestions, and 4) integrative use with external sources. We provide a comprehensive analysis of these patterns . Several IDE-based code generation tools have been released in the past few years, such as GitHub's Copilot [8], Kite [14], Amazon's Code Whisperer [20], Tabnine [22], and WPCode [28]. Research reveals that being able to achieve their full potential requires a certain level of guidance to ensure that the tool's output aligns with the user's goal [21].